携程用户预定房型预测分析案例

您所在的位置:网站首页 roomservice什么by hotel 携程用户预定房型预测分析案例

携程用户预定房型预测分析案例

2023-10-08 04:36| 来源: 网络整理| 查看: 265

来源:barefootgirl - kesci.com 原文链接:携程用户预定房型预测 点击以上链接👆 不用配置环境,直接在线运行 数据集下载链接:携程房型产品用户行为数据集

问题介绍

数据集为携程用户预定携程房型的数据集,已脱敏,包含以下几个部分:用户数据、酒店数据、房型数据。大家可以根据在用户的历史信息,挖掘出用户对于某些房型偏好,预测哪一个售卖房型(roomid)是用户最终预订的。 业务场景说明:

本先对测试数据集的基本字段做一个简单的分析,再对数据集所描述的问题进行预测。

import pandas as pd import numpy as np import matplotlib.pyplot as plt file = '../input/ctrip/competition_train.txt' 一.先对测试数据集的基本字段做一个简单的分析。

首先将训练集中的13个基本字段orderid、uid、orderdate、hotelid、basicroomid、roomid、orderlabel、star、rank 、returnvalue、price_deduct、basic_minarea、basic_maxarea信息提取出来进行分析。

basic_colnames = ['orderid','uid','orderdate','hotelid','basicroomid','roomid', 'orderlabel', 'star', 'rank', 'returnvalue', 'price_deduct', 'basic_minarea', 'basic_maxarea'] basic_data = pd.read_csv(file, sep='\t', header=0, usecols=basic_colnames) #basic_data.info() #basic_data['orderid'].value_counts() #basic_data['orderlabel'].value_counts() #basic_data['uid'].value_counts() #basic_data['hotelid'].value_counts() #basic_data['basicroomid'].value_counts() #basic_data['roomid'].value_counts() #basic_data['star'].value_counts() #basic_data['orderdate'].value_counts()

训练集一共有7724875条记录; 一共有211580个订单号(oederid),其中ORDER_12473322的纪录最多有686条; 成功订购211580份订单(orderlable=1); 一共有187162个用户(uid),其中USER_615018的纪录最多有2105条; 一共有51209家酒店(hotelid),其中HOTEL_135032的记录最多有34376条; 一共有277373种物理房型号(basicroomid),其中BASIC_205234的纪录最多有6589条; 一共有2924400种房型号(roomid),其中ROOM_18387063的纪录最多有204条。 酒店的级别只有4种,5、7、9、11,并且纪录数依次上升; 数据集的订单订购时间是从2013年4月14号至4月20号一周的时间。

basic_data.describe()

发现异常数据,basic_minarea和basic_maxarea这两列数据缺失,且其最小值小于0,这是不合理的。

#basic_data.isnull().any() #basic_data[train_data['basic_minarea']0)] basic_data1 = basic_data1[basic_data1['orderlabel']==1] #basic_data1.info() #basic_data1.describe() #basic_data1['uid'].value_counts() #basic_data1['hotelid'].value_counts() #basic_data1['basicroomid'].value_counts() #basic_data1['roomid'].value_counts() #basic_data1['star'].value_counts()

经过数据清洗可以看出成功订购的有效纪录为179667条; 一共有160471个用户(uid),其中USER_609501成功订购次数最多为48次; 一共有41837家酒店(hotelid),其中HOTEL_132727被成功订购次数最多为225次; 一共有76861种物理房型号(basicroomid),其中BASIC_463407房型被成功订购次数最多为209次; 一共有134512种房型号(roomid),其中ROOM_22089993房间型号被成功订购次数最多为76次。

查看订购房型的价格分布 plt.figure(figsize=(15, 5)) plt.subplot(1,2,1) plt.hist(basic_data1['price_deduct']) plt.xlabel('价格') plt.ylabel('订单数') plt.title('房型订单价格分布') #观察到价格主要在5000以内,看5000以内的分布 price_data = basic_data1[basic_data1['price_deduct']2 else x-1) all["roomservice_3"]=all["roomservice_3"].apply(lambda x:1 if x>0 else 0) for i in range(2,9): all["service_equal_%s"%i] = list(map(lambda x, y: 1 if x == y else 0, all["roomservice_%s"%i], all["user_roomservice_%s_max"%i])) del all["user_roomservice_2_0ratio"] del all["user_roomservice_3_0ratio"] del all["user_roomservice_5_0ratio"] del all["user_roomservice_7_1ratio"] #添加转化率特征 #提取basicroomid的转化率 feature_df=all[["orderid","basicroomid","orderlabel"]].copy() feature_df.sort_values("orderlabel") feature_df=feature_df.drop_duplicates(["orderid","basicroomid"],keep="last") basicroom_mean=pd.DataFrame(feature_df.groupby("basicroomid").orderlabel.mean()).reset_index() basicroom_mean.columns=["basicroomid","basicroomid_mean"] basicroom_sum=pd.DataFrame(feature_df.groupby("basicroomid").orderlabel.sum()).reset_index() basicroom_sum.columns=["basicroomid","basicroomid_sum"] all = all.merge(basicroom_mean, on="basicroomid", how="left").fillna(0) all = all.merge(basicroom_sum, on="basicroomid", how="left").fillna(0) all=df_median(all) all=df_min(all) all=df_min_orderid(all) all["basicroomid_price_rank"] = all['price_deduct'].groupby([all['orderid'], all['basicroomid']]).rank() all["orderid_price_deduct_min_rank"] = all['orderid_price_deduct_min'].groupby(all['orderid']).rank() all = df_rank_mean(all) all = df_roomrank_mean(all) all=merge_mean(all,["basicroomid"],"basic_week_ordernum_ratio","basic_week_ordernum_ratio_mean") all=merge_mean(all,["basicroomid"],"basic_recent3_ordernum_ratio","basic_recent3_ordernum_ratio_mean") all=merge_mean(all,["basicroomid"],"basic_comment_ratio","basic_comment_ratio_mean") all=merge_mean(all,["basicroomid"],"basic_30days_ordnumratio","basic_30days_ordnumratio_mean") all=merge_mean(all,["basicroomid"],"basic_30days_realratio","basic_30days_realratio_mean") all=merge_mean(all,["roomid"],"room_30days_ordnumratio","room_30days_ordnumratio_mean") all=merge_mean(all,["roomid"],"room_30days_realratio","room_30days_realratio_mean") all["city_num"]=all["user_ordernum"]/all["user_citynum"] all["area_price"]=all["user_avgprice"]/all["user_avgroomarea"] all["price_max_min_rt"]=all["user_maxprice"]/all["user_minprice"] all["basicroomid_price_deduct_min_minprice_rt"]=all["basicroomid_price_deduct_min"]/all["user_minprice"] all["price_dif"]=all["basicroomid_price_deduct_min"]-all["price_deduct"] all["price_dif_rt"]=all["basicroomid_price_deduct_min"]/all["price_deduct"] all["price_dif_hotel"]=all["orderid_price_deduct_min"]-all["price_deduct"] all["price_dif_hotel_rt"]=all["orderid_price_deduct_min"]/all["price_deduct"] all["order_basic_minprice_dif"]=all["basicroomid_price_deduct_min"]-all["orderid_price_deduct_min"] all["order_basic_minprice_rt"]=all["basicroomid_price_deduct_min"]/all["orderid_price_deduct_min"] all["price_tail1"]=all["price_deduct"]%10 all["price_tail1"]=list(map(lambda x:1 if x==4 or x==7 else 0,all["price_tail1"])) all["price_tail2"]=all["price_deduct"]%100 all["price_ori"] = list(map(lambda x, y:x+y, all["price_deduct"], all["returnvalue"])) for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]: all["ordertype_%s_num"%i] = list(map(lambda x, y: x*y, all["ordertype_%s_ratio"%i], all["user_ordernum"])) del all["ordertype_%s_ratio" % i] # 所有的 for c in ["orderbehavior_1_ratio", "orderbehavior_2_ratio", "orderbehavior_6_ratio", "orderbehavior_7_ratio"]: all[c] = list(map(lambda x, y: x*y, all[c], all["user_ordernum"])) # 一周的 for c in ["orderbehavior_3_ratio_1week", "orderbehavior_4_ratio_1week", "orderbehavior_5_ratio_1week"]: all[c] = list(map(lambda x, y: x * y, all[c], all["user_ordnum_1week"])) # 一个月的 #for c in ["orderbehavior_3_ratio_1month", "orderbehavior_4_ratio_1month", "orderbehavior_5_ratio_1month"]: # all[c] = list(map(lambda x, y: x * y, all[c], all["user_ordnum_1month"])) # 三个月的 for c in ["orderbehavior_3_ratio_3month", "orderbehavior_4_ratio_3month", "orderbehavior_5_ratio_3month"]: all[c] = list(map(lambda x, y: x * y, all[c], all["user_ordnum_3month"])) all["price_star"] = all["price_deduct"]/(all["star"]-1) all["star_dif"] = all["user_avgstar"]-all["star"] all["price_ave_dif_rt"] = all["price_deduct"]/all["user_avgdealprice"] all["price_ave_star_dif"] = all["price_deduct"]/all["user_avgprice_star"] all["price_h_w_rt"] = all["user_avgdealpriceholiday"] / all["user_avgdealpriceworkday"] all["price_ave_dif"] = all["price_deduct"] - all["user_avgdealprice"] all["user_roomservice_4_32_rt"] = all["user_roomservice_4_3ratio"] / all["user_roomservice_4_2ratio"] all["user_roomservice_4_43_rt"] = all["user_roomservice_4_4ratio"] / all["user_roomservice_4_3ratio"] 4.建立模型

本项目采用的是lgb算法。用40万的数据进行训练。

train = all.iloc[:400000] test = all.iloc[400000:] #算法测试 train_data = train.copy() train_y=train_data["orderlabel"].values del train_data["orderlabel"] #lgb算法 train_data1 = lgb.Dataset(train_data, label=train_y) params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'binary_logloss', 'min_child_weight': 1.5, 'num_leaves': 2 ** 5, 'lambda_l2': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'learning_rate': 0.05, 'tree_method': 'exact', 'seed': 2019, 'nthread': 12} num_round = 500 model = lgb.train(params, train_data1, num_round) 5.预测结果

用10万数据进行测试,预测orderlabel为 0或1的概率,最后依据orderid分组,选择orderid组别中预测为1概率最大的roomid作为最终预测结果。

test_data = test.copy() del test_data['orderlabel'] test_result = model.predict(test_data.values) test_result = pd.DataFrame(test_result) test_result.columns = ["prob"] test_result["orderid"] = test_data["orderid"].values test_result["pre_roomid"] = test_data["roomid"].values result = test_result.sort_values(by=['orderid',"prob"],ascending = False) result = result.drop_duplicates("orderid", keep="first") #预测房型结果 test_predict = result.pivot_table(index='orderid', values='pre_roomid').copy() #真实房型结果 test_data_tmp = test[test['orderlabel'] == 1][['orderid','roomid']] test_truth = test_data_tmp.pivot_table(index='orderid', values='roomid').copy() #测试集中orderid的总数为i,房型预测正确的orderid总数为j。 i = 0;j = 0 orderid = test_data_tmp['orderid'] for k in orderid: i = i+1 if test_predict['pre_roomid'][k] == test_truth['roomid'][k]: j = j+1 print(i, j) 2924 1452

可以看到在测试集中一共有2924个orderid,其中订购房型预测正确的有1452个,准确率为49.66%。



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3